Predictive Modeling of Weather Station Data:

Linear Regression vs. Graph Neural Network

Author

Colby Fenters & Lilith Holland (Advisor: Dr. Cohen)

Published

July 30, 2025

Slides: slides.html ( Go to slides.qmd to edit)

Important

Remember: Your goal is to make your audience understand and care about your findings. By crafting a compelling story, you can effectively communicate the value of your data science project.

Carefully read this template since it has instructions and tips to writing!

Introduction

Accurate weather prediction is a crucial task with widespread implications across agriculture, transportation, disaster preparedness, and energy management. Traditional forecasting methods often rely on statistical models or physics-based simulations, however, with the advancement of graphical neural networks (GNN) we believe there is potential in a more modern deep learning approach.

In this project, we explore the predictive power of a traditional linear regression model and a GNN on real-world weather station data. Our aim is to evaluate whether the GNN’s ability to incorporate spatial relationships between stations offers a measurable advantage over more conventional techniques

The dataset consists of multiple weather stations located within the same geographic region. Each station collects meteorological variables over time, and can be represented as a node within a broader spatial network. For the linear model baseline, a single model will be trained using all stations’ data aggregated per feature for each time step.

For the GNN the model will be trained on the entire network of stations, where each node corresponds to a station and edges represent spatial relationships. The graph is encoded via a dense adjacency matrix, excluding self-connections. The GNN aims to leverage the inherent spatial structure of the data, potentially capturing regional weather patterns and inter-station dependencies that are invisible to traditional models.

Our evaluation focuses on forecasting performance over the last 6-months of the dataset. We asses how well each modelling approach predicts key weather variables and investigate the conditions under which one model may outperform the other.

Methods

This section outlines the modeling approaches, data structure, and training procedures used to compare the traditional linear model and the GNN on weather station data.

1. Data Structure

The base dataset consists of 33 features, over 8 stations, with 96,408 hourly time-steps with intermittent sections of missing data. The key features of interest are as follows with descriptions from (Herzmann 2023):

Feature Description
station Station identifier code (3/4 characters)
valid Timestamp of the observation
lon Longitude of station
lat Latitude of station
elevation Elevation of station
tmpf Air temperature in Fahrenheit
relh Relative humidity in percent
drct Wind direction in degrees
sknt Wind speed in knots
p01i One hour precipitation for the period from the observation time to the time of the previous hourly precipitation reset
vsby visibility in miles

While the dataset has many other features included, the remaining features are of little interest for the purposes of this project.

2. Cleaning Process

The cleaning process for the data was a multistage procedure that started with:

  1. Filling in all missing time steps for all stations between the min and max dataset timestep

  2. Compressing the timesteps from 1hr intervals down to 6hr intervals

  3. Removing any features that were missing more than 10% of their values in more than 6 of the 8 stations.

  4. Remove any stations that are missing more than 2 years of valid data (some stations are newer than others)

  5. Perform correlation analysis on each feature by between nodes and within nodes

  6. Scale remaining features with appropriate scalers

  7. Transform remaining dataframe into an array of shape time, station, feature

3. Linear Model

The linear model is formulated as a time-series regression task. It uses the feature information from the previous twenty-eight time steps (seven days) to predict the feature values at the next time step. Each input consists of an aggregation of each nodes feature for the five meteorological features, resulting in a fixed length input vector per prediction target.

\(t_{+1}=\text{features}_t+\text{features}_{t-1}+...+\text{features}_{t-27}\)

\(\text{feature}_t=\text{tmpf, relh, sknt, drctsin, drctcos}\)

The five input features are:

  • Temperature
  • Relative Humidity
  • Wind Speed
  • Wind Direction (represented as sin and cosine components)

4. GNN

The GNN is designed to capture spatiotemporal dependencies in the weather station network. It is implemented using PyTorch and follows a structure inspired by the Diffusion Convolutional Recurrant Neural Network (DCRNN) architecture.

  • Architecture
    • Input Format:
      • Data is structured using the StaticGraphTemporalSignal format
      • Data is split such that:
        • Testing = last 6 months (730 6 hr time steps)
        • Training = Head 80% of the remaining data
        • Validation = Tail 20% of the remaining data

Each node represents a weather station and temporal sequences of node features are used for prediction.

  • Layers:
    • A DCRNN layer to capture spatial and temporal dependencies
    • A ReLU activation function
    • A Linear output layer for final prediction
    • The full structure is:
      • DCRN(140, 64) -> ReLU -> DCRN(64, 32) -> ReLU -> DCRN(32, 32) -> ReLU -> Linear(32, 1)
  • Training Configuration
    • Optimizer
      • Adam
  • Learning Rate:
    • Base learning rate of 0.01 but will reduce by 0.1 at a plateau down until 1e-5
  • Epochs:
    • Trained for a maximum of 100 epochs with an early exit on plateau callback

The model is trained to predict the temperature for the next time step based on the preceding twnty-eight time steps, analogous to the linear model.

Analysis and Results

Data Exploration and Visualization

1. Source:

The data used in this project is source from the Iowa Environmental Mesonet at the Iowa State University through (Herzmann 2023), where weather station data is aggregated from variable sources based on the standards shown in (Administration 2021). For our specific use case we decided to select data over a ten year period ranging from 2010 to 2020 for nine Kanasa nodes. These nodes were selected primarily due to their proximity and location.

shape: (5, 32)
station valid lat lon elevation tmpf dwpf relh drct sknt p01i alti mslp vsby gust skyc1 skyc2 skyc3 skyc4 skyl1 skyl2 skyl3 skyl4 wxcodes ice_accretion_1hr ice_accretion_3hr ice_accretion_6hr peak_wind_gust peak_wind_drct peak_wind_time feel snowdepth
cat datetime[μs] f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 f32 cat cat cat cat f32 f32 f32 f32 str f32 f32 f32 f32 f32 datetime[μs] f32 f32
"19S" 2010-06-25 00:00:00 37.496899 -100.832901 892.570007 null null null null null null null null null null null null null null null null null null null null null null null null null null null
"ULS" 2013-06-01 12:00:00 37.604 -101.373596 930.429993 null null null null null null null null null null null null null null null null null null null null null null null null null null null
"JHN" 2012-06-09 12:00:00 37.578201 -101.7304 1012.710022 null null null null null null null null null null null null null null null null null null null null null null null null null null null
"GCK" 2011-04-29 06:00:00 37.927502 -100.724403 881.0 52.0 35.0 52.638332 150.0 10.0 null 29.885 1010.349976 10.0 null null null null null null null null null null null null null null null null 49.965 null
"GCK" 2017-07-15 12:00:00 37.927502 -100.724403 881.0 70.974998 65.900002 83.797501 101.25 3.625 0.0275 30.21125 1019.299988 9.25 null null null null null 7666.666504 9125.0 9000.0 null null null null null null null null 71.137497 null

2. EDA

Once the nine nodes were selected further analysis was required to both reduce the volume of data as well as to ensure the quality of the data as the source is very clear about the poor quality of the data provided.

This reduction was initially done by first finding which features were missing no more than 10% of their values and then finding which nodes for a year range were missing no more than 10% of their values. During this step one node was dropped due to the fact that it was introduced after 2020 and thus had no valid data for the date-range selected.

With this visual a date range was selected foe 2018 to 2020 as this range had the most valid features and stations while also being quite recent. As this range was selected the ULS station was dropped resulting in seven valid stations.

shape: (30_667, 12)
station valid lat lon elevation tmpf dwpf relh sknt feel drct_sin drct_cos
cat datetime[μs] f32 f32 f32 f32 f32 f32 f32 f32 f64 f64
"GCK" 2018-01-01 00:00:00 37.927502 -100.724403 881.0 9.283334 -6.5 48.148335 8.666667 -4.161667 0.851117 0.524977
"LBL" 2018-01-01 00:00:00 37.044201 -100.9599 879.0 12.316667 -1.983333 52.014999 10.166667 -1.721667 0.664796 0.747025
"EHA" 2018-01-01 00:00:00 37.000801 -101.879997 1099.0 15.555555 5.255556 63.382778 7.388889 4.482222 0.970763 0.24004
"HQG" 2018-01-01 00:00:00 37.163101 -101.370499 956.52002 14.311111 -1.605556 48.468334 7.777778 2.681667 0.936332 0.351115
"3K3" 2018-01-01 00:00:00 37.991699 -101.7463 1005.700012 13.1 -0.9 53.127777 6.777778 2.151111 0.981255 0.192712
"EHA" 2020-12-31 00:00:00 37.000801 -101.879997 1099.0 42.355556 16.588888 34.955555 3.0 40.254444 -0.533205 -0.845986
"HQG" 2020-12-31 00:00:00 37.163101 -101.370499 956.52002 40.722221 13.255555 32.23111 1.666667 39.450001 0.977334 -0.211704
"3K3" 2020-12-31 00:00:00 37.991699 -101.7463 1005.700012 40.200001 12.2 31.377777 4.555555 36.683334 -0.700217 -0.71393
"JHN" 2020-12-31 00:00:00 37.578201 -101.7304 1012.710022 40.711113 17.4 38.554443 4.777778 37.06889 -0.824675 -0.565607
"19S" 2020-12-31 00:00:00 37.496899 -100.832901 892.570007 39.922222 15.388889 36.552223 6.111111 35.113335 -0.824675 0.565607

3. Graph Creation

Once the valid date-range was selected and all features were reduced to only those that could be imputed the graph needed to be generated.

As the number of nodes were low it made sense to generate a dense graph without self referencing. This was simple done by looping over every node and producing an adjacency matrix. The matrix was then weighted by the inverse distance scaled with a minmax scaler to ensure the values all ranged from 0 to 1.

4. Graph Imputation

Once the graph was created the data needed to be imputed both spatially and temporally. This was done through a two step process where first the data was imputed over the time slice only considering the spatial information, and then a second time temporally only considering the temporal information. While this process is not perfect it was able to accurately impute data for the spatiotemporal graph structure with no apparent issues as shown below.

5. Correlation Analysis

Once the data was imputed and cleaned a Correlation Analysis was run to see if any features should be removed due to strong correlation. This analysis was done both between nodes and within the node. This analysis found that dwpf and feel both had a strong correlation with the target variable tmpf. This makes logical sense as both dewpoint and feels like temperature quite literally use temperature to calculate their values. As such these two features were removed.

Following the within node correlation, features were then compared between nodes which showed if it made any sense to consider the spatial relationship of the data. As shown below there is a very strong correlation between all nodes irregardless of feature.

6. Scaling

Following all of the above steps the final dataset was reduced to 5 features, over 7 stations with 4381 6hr time steps. This dataset was then scaled through the use of sklearns robust scaler.

shape: (30_667, 6)
┌─────────┬───────────┬──────────┬───────────┬───────────┬───────────┐
│ station ┆ tmpf      ┆ relh     ┆ sknt      ┆ drct_sin  ┆ drct_cos  │
│ ---     ┆ ---       ┆ ---      ┆ ---       ┆ ---       ┆ ---       │
│ cat     ┆ f64       ┆ f64      ┆ f64       ┆ f64       ┆ f64       │
╞═════════╪═══════════╪══════════╪═══════════╪═══════════╪═══════════╡
│ GCK     ┆ -1.355721 ┆ 0.481483 ┆ -0.080357 ┆ 0.851117  ┆ 0.524977  │
│ LBL     ┆ -1.265174 ┆ 0.52015  ┆ 0.160714  ┆ 0.664796  ┆ 0.747025  │
│ EHA     ┆ -1.168491 ┆ 0.633828 ┆ -0.285714 ┆ 0.970763  ┆ 0.24004   │
│ HQG     ┆ -1.205638 ┆ 0.484683 ┆ -0.223214 ┆ 0.936332  ┆ 0.351115  │
│ 3K3     ┆ -1.241791 ┆ 0.531278 ┆ -0.383929 ┆ 0.981255  ┆ 0.192712  │
│ …       ┆ …         ┆ …        ┆ …         ┆ …         ┆ …         │
│ EHA     ┆ -0.368491 ┆ 0.349556 ┆ -0.991072 ┆ -0.533205 ┆ -0.845986 │
│ HQG     ┆ -0.417247 ┆ 0.322311 ┆ -1.205357 ┆ 0.977334  ┆ -0.211704 │
│ 3K3     ┆ -0.432836 ┆ 0.313778 ┆ -0.741072 ┆ -0.700217 ┆ -0.71393  │
│ JHN     ┆ -0.417579 ┆ 0.385544 ┆ -0.705357 ┆ -0.824675 ┆ -0.565607 │
│ 19S     ┆ -0.441128 ┆ 0.365522 ┆ -0.491072 ┆ -0.824675 ┆ 0.565607  │
└─────────┴───────────┴──────────┴───────────┴───────────┴───────────┘
MSE: 0.0562

MSE: 0.0147

Modeling and Results

  • Explain your data preprocessing and cleaning steps.

  • Present your key findings in a clear and concise manner.

  • Use visuals to support your claims.

  • Tell a story about what the data reveals.

Conclusion

  • Summarize your key findings.

  • Discuss the implications of your results.

References

Administration, Federal Aviation. 2021. “Automated Surface Observing System.” https://www.weather.gov/media/asos/aum-toc.pdf.
Herzmann, Daryl. 2023. “IEM:: Download ASOS/AWOS/Metar Data.” Iowa Environmental Mesonet. https://www.mesonet.agron.iastate.edu/request/download.phtml?network=KS_ASOS.